Screen%20Shot%202020-07-11%20at%206.36.11%20PM.png

Hotel Booking Cancelation Prediction Modeling

Xia Fu --07/10/2020

Introduction

This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. In this project, I built up model to predict the booking cancelation for each reservation, which is a binary classification problem.

Setup

In [14]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from dateutil.parser import parse
from scipy.stats import sem
from sklearn import preprocessing
import sklearn.metrics as sklmetrics
import warnings
warnings.filterwarnings('ignore')

1. Data review and come up with questions to ask

After reading the dataset, I generated the following questions that I am interested with this dataset and want to ask. I will conduct exploratory data analysis to get better understandings to them in next part.

QUESTIONS:

  1. Booking distribution channel: which channel could bring the most passengers to which hotel
  2. is_repeated_guest: which hotel has more repeated guest, this could tell the quality of the hotel
  3. reserved_room_type/assigned_room_type: does the guests offered their reserved room type?
  4. adr: the trend of daily price of the two hotels along the time.
  5. reservation_status: a comparison about the percentage of the reservation that was canceled by customers.
  6. market_segment
  7. country: who are the customers come from?
  8. hotel: which hotel comes more customers?
  9. arrival_date_month: customers traffic along with time.
In [15]:
df = pd.read_csv("hotel_bookings.csv")
In [16]:
print(df.shape)
df.head()
(119390, 32)
Out[16]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 2015-07-03

5 rows × 32 columns

In [19]:
# have a look of all the columns
df.columns.to_list()
Out[19]:
['hotel',
 'is_canceled',
 'lead_time',
 'arrival_date_year',
 'arrival_date_month',
 'arrival_date_week_number',
 'arrival_date_day_of_month',
 'stays_in_weekend_nights',
 'stays_in_week_nights',
 'adults',
 'children',
 'babies',
 'meal',
 'country',
 'market_segment',
 'distribution_channel',
 'is_repeated_guest',
 'previous_cancellations',
 'previous_bookings_not_canceled',
 'reserved_room_type',
 'assigned_room_type',
 'booking_changes',
 'deposit_type',
 'agent',
 'company',
 'days_in_waiting_list',
 'customer_type',
 'adr',
 'required_car_parking_spaces',
 'total_of_special_requests',
 'reservation_status',
 'reservation_status_date']

2. Data cleansing

I created a function to find the number of missing values for each variable and calculate the % of missing value. From the missing value table, there are 4 features that contains missing values.

  1. "Company" variable contains the most--94.3%, which I want to drop.
  2. "agent" is the ID of the travel agency that made the booking. I will fill ths NAs with -1
  3. "country" is the origin of passengers. I will delete the records with NAs in this column.
  4. "Children" is the number of children in the bookings. I will also delete the record with NAs.
In [20]:
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns
In [21]:
missing_values_table(df)
Your selected dataframe has 32 columns.
There are 4 columns that have missing values.
Out[21]:
Missing Values % of Total Values
company 112593 94.3
agent 16340 13.7
country 488 0.4
children 4 0.0
In [22]:
# drop "company"
df2 = df.drop(["company"], axis =1)
In [23]:
df2.loc[df2["agent"].isnull(), "agent"]= -1
In [24]:
missing_values_table(df2)
Your selected dataframe has 31 columns.
There are 2 columns that have missing values.
Out[24]:
Missing Values % of Total Values
country 488 0.4
children 4 0.0
In [25]:
df2 = df2.dropna()
print(df2.shape)
df2.head()
(118898, 31)
Out[25]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... booking_changes deposit_type agent days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... 3 No Deposit -1.0 0 Transient 0.0 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... 4 No Deposit -1.0 0 Transient 0.0 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... 0 No Deposit -1.0 0 Transient 75.0 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... 0 No Deposit 304.0 0 Transient 75.0 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... 0 No Deposit 240.0 0 Transient 98.0 0 1 Check-Out 2015-07-03

5 rows × 31 columns

In [26]:
missing_values_table(df2)
Your selected dataframe has 31 columns.
There are 0 columns that have missing values.
Out[26]:
Missing Values % of Total Values

3. EDA

I conducted some visualizations in order to get better understanding of this dataset.

In [7]:
# the number of bookings for each hotel
sns.countplot(x="hotel", data = df,color="skyblue")
plt.title('Frequency of Hotel'); plt.ylabel('Frequency');

The number of bookings for each month: there is an increase in booking number during spring and fall.

In [6]:
df_y = df
plt.figure(figsize=(10,5), dpi= 100)
month = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
sns.countplot(x="arrival_date_month", data = df.loc[df.arrival_date_year == 2016 ,: ],order = month, palette=("Blues_d"))
plt.title('Month'); plt.ylabel('Frequency');

The number of bookings through each distribution channel: TA/TO used as the distribution channel is the most

In [11]:
sns.countplot(x="distribution_channel", data = df,color="skyblue")
plt.title('Frequency of Hotel'); plt.ylabel('Frequency');

The number of repeated guest(1), and non-repeated guest(0) for each hotel

In [34]:
# take out only the repeated guest records
df_repeated = df.loc[df["is_repeated_guest"]==1, :]
ax = sns.countplot(x="hotel", hue = "is_repeated_guest", data = df,color="skyblue")
plt.title('Number of guest'); plt.ylabel('Frequency');
In [60]:
def without_hue(plot, feature):
    total = len(feature)
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() / 2 - 0.05
        y = p.get_y() + p.get_height()
        ax.annotate(percentage, (x, y), size = 12)
    plt.show()

Percentage of repeated guest for each hotel. City Hotel has more repeated guest than Resort Hotel.

In [36]:
ax = sns.countplot(x="hotel", data = df_repeated, color="skyblue")
plt.title('Percentage of repeated guest'); plt.ylabel('Frequency');
without_hue(ax, df.hotel)

Number of guest from each country. We can see that a large amount of people are from Portugal.

In [37]:
# group by country and count the number of records
df_country = df.groupby('country').size().reset_index(name='counts')
In [38]:
# only select the top 10 
df_country2 = df_country.nlargest(10, "counts")
In [39]:
ax = sns.barplot(x="country", y = "counts", data = df_country2, color="skyblue")
plt.title('Number of guest'); plt.ylabel('Frequency');

Then, I looked at the daily booking price and draw the time-series graph. We can see that the price is higher in summer than it is in winter.

In [62]:
# change the year, month, and day column to a single date column
df_date = df.copy(deep=False)
d = {"January":1, "February":2, "March":3, "April":4, "May":5, "June":6, "July":7, 
     "August":8, "September":9, "October":10, "November":11, "December":12 }

df_date.arrival_date_month = df_date.arrival_date_month.map(d)
df_date2 = pd.DataFrame({'year': df_date['arrival_date_year'].to_list(),
                   'month': df_date['arrival_date_month'].to_list(),
                   'day': df_date['arrival_date_day_of_month'].to_list()})
In [63]:
df_date["arrival_date"] = pd.to_datetime(df_date2[["year", "month", "day"]])
In [24]:
df_mean = df_date.groupby('arrival_date').adr.mean()


# Plot
plt.figure(figsize=(20,10), dpi= 80)
plt.ylabel("price", fontsize=16)  
plt.xlabel("date", fontsize=16)  
x = [d.date().strftime('%Y-%m-%d') for d in df_mean.index]
plt.plot(x, df_mean, color="red", lw=2) 
 
plt.axis([0, 800, 0, 200])

xtick_labels = [d.date().strftime('%Y-%m-%d')[-10:-3] for d in df_mean.index.tolist()[::30]]

df_sam = df_mean.reset_index()
xtick_location = df_sam.index.tolist()[::30]
plt.xticks(ticks=xtick_location, labels=xtick_labels, rotation=0, fontsize=10, horizontalalignment='center', alpha=.7)

# Decorations
# Lighten borders
plt.gca().spines["top"].set_alpha(0)
plt.gca().spines["bottom"].set_alpha(1)
plt.gca().spines["right"].set_alpha(0)
plt.gca().spines["left"].set_alpha(1)

plt.title("Daily Booking Price", fontsize=20)

# Axis limits
s, e = plt.gca().get_xlim()


# Draw Horizontal Tick lines  
for y in range(10, 200, 5):    
    plt.hlines(y, xmin=s, xmax=e,colors='black', alpha=0.5, linestyles="--", lw=0.5)

plt.show()

A comparison of the percentage of the reservation that was canceled by passengers for each hotel. City hotel has a higher percentage of cancelation.

In [58]:
# select the canceled records
df_cancel=df.loc[df.is_canceled == 1, :] 
print("percentage of cancelation", len(df_cancel)/len(df2))
percentage of cancelation 0.371949065585628
In [61]:
ax = sns.countplot(x="hotel", data = df_cancel, color="green")
plt.title('Canceled reservation'); plt.ylabel('Frequency');
without_hue(ax, df.hotel)

Linear relationship between the percentage of canceled reservation and lead time.

In [17]:
# lead time
# group by lead time and select only the canceled record
lead_cancel_data1 = df.groupby("lead_time")["reservation_status"].describe()
lead_cancel_data2 = lead_cancel_data1.loc[lead_cancel_data1.top == "Canceled", : ]
In [18]:
# create a new column called percentage, which is the rate of canceled booking over all the bookings with certain lead time point
lead_cancel_data2["percentage"] = (lead_cancel_data2["freq"]/lead_cancel_data2["count"] )* 100
In [60]:
plt.figure(figsize=(15,10))
sns.regplot(x=lead_cancel_data2.index.values, y = lead_cancel_data2.percentage.values)
plt.title("correlation between lead time and cancelation", fontsize = 20)
plt.xlabel("lead time", fontsize = 15)
plt.ylabel("cancelation percentage", fontsize = 15)
plt.ylim(20,110)
plt.show()

No obvious relations between the percentage of cancelation and time

In [90]:
df_cancel_count = df_date.groupby('arrival_date').is_canceled.sum()

df_cancel_count= df_cancel_count.to_frame()
df_total = df_date.groupby('arrival_date').is_canceled.count()
df_total= df_total.to_frame()

df_cancel_count['perc_cancel']= df_cancel_count["is_canceled"]/df_total["is_canceled"] * 100
df_cancel_count=df_cancel_count.drop(['is_canceled'], axis=1)
In [94]:
# Plot
plt.figure(figsize=(20,10), dpi= 80)
plt.ylabel("Percentage of cancelation", fontsize=16)  
plt.xlabel("date", fontsize=16)  
x = [d.date().strftime('%Y-%m-%d') for d in df_cancel_count.index]
plt.plot(x, df_cancel_count, color="blue", lw=2) 
 
plt.axis([0, 800, 0, 100])

xtick_labels = [d.date().strftime('%Y-%m-%d')[-10:-3] for d in df_cancel_count.index.tolist()[::30]]

df_sam = df_cancel_count.reset_index()
xtick_location = df_sam.index.tolist()[::30]
plt.xticks(ticks=xtick_location, labels=xtick_labels, rotation=0, fontsize=10, horizontalalignment='center', alpha=.7)

# Decorations
# Lighten borders
plt.gca().spines["top"].set_alpha(0)
plt.gca().spines["bottom"].set_alpha(1)
plt.gca().spines["right"].set_alpha(0)
plt.gca().spines["left"].set_alpha(1)


# Axis limits
s, e = plt.gca().get_xlim()


# Draw Horizontal Tick lines  
for y in range(10, 200, 5):    
    plt.hlines(y, xmin=s, xmax=e,colors='black', alpha=0.5, linestyles="--", lw=0.5)

plt.show()

4. Feature Preparation and Selection

4.1 Dealing with Ordinal / Nominal variables

  1. Drop "reservation_status" because it has same meaning to "is_canceled" variable, which is considered as duplicated.
  2. For "country", "meal", "market segment", "distribution channel", "reserved_room_type", "assigned_room_type", use LB label to transform to numerical numbers to avoid too many columns and reduce the computation time.
  3. for " arrival_date_month ", assign the corresponding number of months converting it to numerical variables.
  4. for other ordinal/nominal varaibles, use get_dummies() function to conduct one-hot coding.
In [27]:
df3 = df2.drop(["reservation_status"], axis = 1)
df3 = df3.drop(["reservation_status_date"], axis = 1)
map = {
    "arrival_date_month": {
        "January": 1,
        "February": 2,
        "March": 3,
        "April": 4,
        "May": 5,
        "June": 6,
        "July": 7,
        "August": 8,
        "September": 9,
        "October": 10,
        "November": 11,
        "December":12
    }
}

df3 = df3.replace(map)
print(df3.shape)
df3.head()
(118898, 29)
Out[27]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... reserved_room_type assigned_room_type booking_changes deposit_type agent days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests
0 Resort Hotel 0 342 2015 7 27 1 0 0 2 ... C C 3 No Deposit -1.0 0 Transient 0.0 0 0
1 Resort Hotel 0 737 2015 7 27 1 0 0 2 ... C C 4 No Deposit -1.0 0 Transient 0.0 0 0
2 Resort Hotel 0 7 2015 7 27 1 0 1 1 ... A C 0 No Deposit -1.0 0 Transient 75.0 0 0
3 Resort Hotel 0 13 2015 7 27 1 0 1 1 ... A A 0 No Deposit 304.0 0 Transient 75.0 0 0
4 Resort Hotel 0 14 2015 7 27 1 0 2 2 ... A A 0 No Deposit 240.0 0 Transient 98.0 0 1

5 rows × 29 columns

In [28]:
le = preprocessing.LabelEncoder()
df3["meal"] = le.fit_transform(df3["meal"])
df3["country"] = le.fit_transform(df3["country"])
df3["market_segment"] = le.fit_transform(df3["market_segment"])
df3["distribution_channel"] = le.fit_transform(df3["distribution_channel"])
df3['reserved_room_type']= le.fit_transform(df3["reserved_room_type"])
df3['assigned_room_type']=le.fit_transform(df3["assigned_room_type"])
In [29]:
feat_dummy=["deposit_type", "customer_type", "hotel"]
In [30]:
df_dum = pd.get_dummies(df3[feat_dummy], drop_first=True)
df4 = pd.concat([df3, df_dum], axis=1)
df4 = df4.drop(feat_dummy, axis = 1)
print(df4.shape)
df4.head()
(118898, 32)
Out[30]:
is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children ... days_in_waiting_list adr required_car_parking_spaces total_of_special_requests deposit_type_Non Refund deposit_type_Refundable customer_type_Group customer_type_Transient customer_type_Transient-Party hotel_Resort Hotel
0 0 342 2015 7 27 1 0 0 2 0.0 ... 0 0.0 0 0 0 0 0 1 0 1
1 0 737 2015 7 27 1 0 0 2 0.0 ... 0 0.0 0 0 0 0 0 1 0 1
2 0 7 2015 7 27 1 0 1 1 0.0 ... 0 75.0 0 0 0 0 0 1 0 1
3 0 13 2015 7 27 1 0 1 1 0.0 ... 0 75.0 0 0 0 0 0 1 0 1
4 0 14 2015 7 27 1 0 2 2 0.0 ... 0 98.0 0 1 0 0 0 1 0 1

5 rows × 32 columns

4.2 Data Standardization

Standardize only those numerical variables: features_s

In [31]:
columns = df4.columns
features = columns.drop('is_canceled')
features_s = columns.drop('is_canceled').drop(df_dum.columns)
In [32]:
from sklearn.preprocessing import StandardScaler
sc =StandardScaler()
df4[features_s] =sc.fit_transform(df4[features_s])
print(df4.shape)
df4.head()
(118898, 32)
Out[32]:
is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children ... days_in_waiting_list adr required_car_parking_spaces total_of_special_requests deposit_type_Non Refund deposit_type_Refundable customer_type_Group customer_type_Transient customer_type_Transient-Party hotel_Resort Hotel
0 0 2.223407 -1.636365 0.144836 -0.012256 -1.685694 -0.932429 -1.316808 0.244755 -0.261059 ... -0.132201 -2.020440 -0.253449 -0.721208 0 0 0 1 0 1
1 0 5.918350 -1.636365 0.144836 -0.012256 -1.685694 -0.932429 -1.316808 0.244755 -0.261059 ... -0.132201 -2.020440 -0.253449 -0.721208 0 0 0 1 0 1
2 0 -0.910279 -1.636365 0.144836 -0.012256 -1.685694 -0.932429 -0.790536 -1.483635 -0.261059 ... -0.132201 -0.534870 -0.253449 -0.721208 0 0 0 1 0 1
3 0 -0.854153 -1.636365 0.144836 -0.012256 -1.685694 -0.932429 -0.790536 -1.483635 -0.261059 ... -0.132201 -0.534870 -0.253449 -0.721208 0 0 0 1 0 1
4 0 -0.844799 -1.636365 0.144836 -0.012256 -1.685694 -0.932429 -0.264264 0.244755 -0.261059 ... -0.132201 -0.079295 -0.253449 0.540344 0 0 0 1 0 1

5 rows × 32 columns

4.3 Feature Selection

4.2.1 PCA

Select ~20 features should be reasonable because 20 features can explain most of the variations.

In [33]:
df_x = df4[features]
df_y = df4["is_canceled"]
In [34]:
from sklearn.decomposition import PCA
pca = PCA().fit(df_x)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
Out[34]:
Text(0, 0.5, 'cumulative explained variance')

4.2.2 Step Backward Feature Selection

I selected 25 features from 32 using backward feature selection with random forest classifier, cross validation and scoring by ROC_AUC

In [35]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
sfs2 = SFS(RandomForestClassifier(), 
           k_features=25, 
           forward=False, 
           floating=False, 
           scoring='roc_auc',
           n_jobs = -1,
           cv=2)

sfs2= sfs2.fit(np.array(df_x.fillna(0)), df_y)
In [36]:
selected_feat= df_x.columns[list(sfs2.k_feature_idx_)]
selected_feat
Out[36]:
Index(['lead_time', 'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'agent',
       'days_in_waiting_list', 'required_car_parking_spaces',
       'total_of_special_requests', 'deposit_type_Non Refund',
       'deposit_type_Refundable', 'customer_type_Group',
       'customer_type_Transient-Party'],
      dtype='object')

4.2.3 Pearson Correlation Matrix

Identify any features that has high correlation with another and drop them.

In [37]:
colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(df4[selected_feat].corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d23bf1400>
In [38]:
drop_col = ['assigned_room_type','market_segment', 'reserved_room_type']
x_new = selected_feat.drop(drop_col)
x_new
Out[38]:
Index(['lead_time', 'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'distribution_channel', 'is_repeated_guest',
       'previous_cancellations', 'previous_bookings_not_canceled',
       'booking_changes', 'agent', 'days_in_waiting_list',
       'required_car_parking_spaces', 'total_of_special_requests',
       'deposit_type_Non Refund', 'deposit_type_Refundable',
       'customer_type_Group', 'customer_type_Transient-Party'],
      dtype='object')

4.4 Split the data to traning set and validation set

In [39]:
X = df4[x_new]
Y = df4["is_canceled"]

from sklearn.model_selection import train_test_split 

train_X, test_X, train_y, test_y = train_test_split(X,Y,train_size=0.7, random_state = 1)
print ('Train Features: ',train_X.shape ,
      'Test Features: ',test_X.shape)
print ('Train Labels: ',train_y.shape ,
      'Test Labels: ',test_y.shape)
Train Features:  (83228, 22) Test Features:  (35670, 22)
Train Labels:  (83228,) Test Labels:  (35670,)
In [40]:
df_clean = pd.concat([X, Y], axis=1, sort=False)
df_clean.to_csv("df_clean")

5. Modeling

  1. Null Model

Make a null model before starting other models to have an idea of the benchmark.

Set all dependent variable values as 0 (none was canceled), which means the accuracy of my other models should not be lower than 0.6299.

In [52]:
y_null = np.zeros([len(test_y), 1])
In [62]:
print("Accuracy" , sklmetrics.accuracy_score(test_y, y_null))
roc_auc = roc_auc_score(test_y, y_null)
print("Area under the ROC curve : %f" % roc_auc)
Accuracy 0.6298850574712643
Area under the ROC curve : 0.500000

5.1 Logistic Regression

  1. default logistic regression: all the parameters are default
In [87]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='liblinear')

lr.fit(train_X, train_y)
predict_y = lr.predict(test_X)

print("Accuracy:", lr.score(test_X, test_y))
Accuracy: 0.7871881132604429
In [85]:
roc_auc = roc_auc_score(test_y, predict_y)
print("Area under the ROC curve : %f" % roc_auc)
Area under the ROC curve : 0.735966
In [86]:
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(test_y, predict_y, labels =[0,1])
print("confusion matrix")
conf_mat
Out[86]:
array([[20966,  1502],
       [ 6089,  7113]])
  1. Logistic regression with optimized parameter
In [66]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [1, 5, 10, 20],
                            'penalty': [ 'l1', 'l2']}

grid_search = GridSearchCV(LogisticRegression(solver='liblinear'),  param_grid, cv=5)
grid_search.fit(train_X, train_y)
Out[66]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'C': [1, 5, 10, 20], 'penalty': ['l1', 'l2']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [67]:
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.5f}".format(grid_search.best_score_))
Best parameters: {'C': 10, 'penalty': 'l1'}
Best cross-validation score: 0.78757
In [68]:
print("Accuracy:", grid_search.score(test_X, test_y))
Accuracy: 0.7870759742080179
In [69]:
predict_y2 = grid_search.predict(test_X)
roc_auc = roc_auc_score(test_y, predict_y2)
print("Area under the ROC curve : %f" % roc_auc)
Area under the ROC curve : 0.735736
In [78]:
conf_mat = confusion_matrix(test_y, predict_y2, labels =[0,1])
print("confusion matrix")
print(conf_mat)
confusion matrix
[[20971  1497]
 [ 6098  7104]]

Logistic regression with optimized hyperparameter didn't effectively improve the accuracy.

  1. Decision Tree Classifier with optimized hyperparameter
In [109]:
from sklearn.tree import DecisionTreeClassifier
param = {'criterion':['gini', 'entropy'], 'max_depth':10.**np.arange(1, 4), 
        'max_leaf_nodes':[300, 350, 400, 450, 500]}
grid_dt = GridSearchCV(DecisionTreeClassifier(), param, cv=10)
grid_dt.fit(train_X, train_y)
Out[109]:
GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
             iid='warn', n_jobs=None,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': array([  10.,  100., 1000.]),
                         'max_leaf_nodes': [300, 350, 400, 450, 500]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
In [110]:
print("Best parameters: {}".format(grid_dt.best_params_))
print("Best cross-validation score: {:.5f}".format(grid_dt.best_score_))
print("Accuracy:", grid_dt.score(test_X, test_y))
Best parameters: {'criterion': 'gini', 'max_depth': 1000.0, 'max_leaf_nodes': 500}
Best cross-validation score: 0.85814
Accuracy: 0.855228483319316
In [111]:
predict_y3= grid_dt.predict(test_X)
roc_auc = roc_auc_score(test_y, predict_y3)
print("Area under the ROC curve : %f" % roc_auc)
Area under the ROC curve : 0.835974
In [112]:
conf_mat = confusion_matrix(test_y, predict_y3, labels =[0,1])
print("confusion matrix")
conf_mat
confusion matrix
Out[112]:
array([[20448,  2020],
       [ 3144, 10058]])
In [113]:
dtc = DecisionTreeClassifier(criterion = 'gini', max_depth=1000, max_leaf_nodes = 350)
dtc.fit(train_X, train_y)
fi=pd.Series(dtc.feature_importances_, index=train_X.columns)
fn=fi.sort_values(ascending=True)
fn.plot(kind='barh', color='r', figsize=(10, 6))
plt.xlabel('importance', size=10)
plt.title('Decision Tree Importance', size=15)
plt.tick_params(labelsize=10)
In [115]:
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus

#feature_cols = ['deposit_type_Non Refund', 'agent']
dot_data = StringIO()
export_graphviz(dtc, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names = x_new,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())
Out[115]:

Comparing to Logistic regression, the prediction accuracy and area under curve had a significant improvement.

The importance chart shows that clearly deposite_type has the highest importance in this model, which makes sense because people pay a lot attention on whether their reservation can be refunded or not. And this is relevant to uncertainty people have about this booking. But I didn't think of this feature when doing data visualization. Model can always bring people hiden insights.

Last Thoughts:

Not removing features may improve the accuracy becuase deleting any feature always cause data leakage. We can use PCA to replace step backward feature selection for reducing the dimension.

In order to avoid overfitting in decision tree, we tried to pruning the decision tree by setting limits to max_leaf_nodes and max_depth. Also, We can try Random Forest Classifier for this problem, or other emsemble tree models.

For hotel businesses, they can pay more attention on the following aspects inspired by our results:

  • A high cancelation rate may occur when the deposit can be refunded. But this does not means that all the hotel shouldn't refund reservations. Businesses have to balance it and conduct more research on whether non-refundable booking will lead to less bookings.
  • Feature of agent (ID of the travel agency that made the booking) also contributes a lot to the model.
  • Lead_time: this model matches our prior visualization analysis result that there is a relation between cancelation and lead time. Lead time is longer, cancelation rate is higher.
Created with Jupyter, delivered by Xia Fu.
In [ ]: